Intro to Pandas
Introduction
We will start our data science journey by learning a bit about the most useful Python library for this class: Pandas. As a reminder, a library is a set of tools we load on top of Python that provides new functionalities for a specific problem or type of analysis. Here, Pandas provides functions for data manipulation and analysis, handling structured data like tables or time series and facilitating numerous tasks you might encounter as a scientist. These include:
- Reading/writing data from various commonly-used formats (CSV, Excel, SQL, JSON, etc.)
- Handling missing data
- Filtering, sorting, reshaping and grouping data
- Aggregating data (sum, mean, count, etc.)
- Time series support (date ranges, frequency conversions)
- Statistical operations
Today’s objectives
The objective of this class is by no way to make you an expert in Pandas and data science. Rather, the objective is to take you through the most basic manipulations in order to build the confidence to keep on exploring the use of scientific coding and to include it into your research pipeline. The objectives of this module are to review:
We first start by reviewing the data structure behind Pandas, then we will move on to a coding exercise to make you familiar with some basic functionalities.
Pandas data structure
Pandas consists of two main types of data structures. Let’s make an analogy with Excel.
- Series: A 1D labeled array. Think of a 2-columns Excel spreadsheet where the left column would contain a label (e.g., the time of a measurement) and the right column would contain a value (e.g., the actual value measured at the time specified in the label, let’s say the temperature of a river).
- DataFrame: A 2D labeled table. This is the same as an Excel spreadsheet that would contain more columns than a Series. You can think of having measurements of different variables contained in each column (e.g., the flow rate, the turbidity etc…).
The keyword here is labelled. In Excel, you might get a column using letters and rows using numbers. In Pandas, you can use the column name (e.g., water_temperature) or the row label (e.g., 2021-06-15 14:19:14).
Throughout this class we will focus on the use of DataFrames, not Series. Keep in mind that the behaviour between both is almost identical.
Anatomy of a DataFrame
Figure 1 shows the basic anatomy of a DataFrame that contains four rows and four columns). We already see some data structuring emerging:
- Rows tend to represent entries, which can be:
- Different measurements at specific time steps
- Different samples collected at different place/times
- etc.
- In contrast, column represent attributes and store the properties of each entry:
- The actual values of different measured parameters
- The location and time of collected samples, along with associated analyses (e.g., geochemistry)
- etc.
The first row - i.e. the row containing the column labels - is not considered as an entry. This is because the top row of a dataframe is usually used as the label for the columns. Similarly, we might want to set the first column as a label for the rows (Figure 2). In a nutshell:
- Index refers to the label of the rows. In the index, values are usually unique - meaning that each entry has a different label.
- Column refers to the label of - logically - the columns
Remember that in Python, indexing starts from 0 - so the first row or column has an index of 0.